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Abstract 


A blind separation problem where the sources are not independent, but have variance dependencies 
is discussed. For this scenario Hyvärinen and Hurri (2004) proposed an algorithm which requires 
no assumption on distributions of sources and no parametric model of dependencies between com- 
ponents. In this paper, we extend the semiparametric approach of Amari and Cardoso (1997) 
to variance dependencies and study estimating functions for blind separation of such dependent 
sources. In particular, we show that many ICA algorithms are applicable to the variance-dependent 
model as well under mild conditions, although they should in principle not. Our results indicate 
that separation can be done based only on normalized sources which are adjusted to have station- 
ary variances and is not affected by the dependent activity levels. We also study the asymptotic 
distribution of the quasi maximum likelihood method and the stability of the natural gradient learn- 
ing in detail. Simulation results of artificial and realistic examples match well with our theoretical 
findings. 


Keywords: blind source separation, variance dependencies, independent component analysis, 
semiparametric statistical models, estimating functions 


1. Introduction 


Blind methods of source separation have been successfully applied to many areas of science 

(e.g. Hyvärinen et al., 2001b; Olshausen and Field, 1996; Makeig et al., 1997; Vigario, 1997; 
Ziehe et al., 2000; Thi and Jutten, 1995; Cardoso, 1998a; Parra and Spence, 2000; Cardoso, 2003; 
Meinecke et al., 2005). The basic model assumes that the observed signals are linear superpo- 
sitions of underlying hidden source signals. Let us denote the n source signals by the vector 
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s(t) = (s,(t),-..,5n(t))', and the observed signals by x(t) = (x1(t),...,Xm(t))'. In this paper,! 
we will focus on real-valued signals. The mixing can be expressed as the equation 


x(t) =As(t), 


where A = (aj;) denotes the mixing matrix. For simplicity, we consider the case where the number 
of source signals equals that of observed signals (n = m). Both the sources s(t) and the mixing 
matrix A are unknown, and the goal is to estimate them based on the observation x(t) alone. 

In most blind source separation (BSS) methods, the source signals are assumed to be statisti- 
cally independent. Blind source separation based on such a model is called independent component 
analysis (ICA). By using non-Gaussianity of the sources, the mixing matrix can be estimated and 
the source signals can be extracted under appropriate conditions. There are also further approaches 
of BSS, that are, for example, based on second-order statistics and algorithms exploiting nonstation- 
arity. The second-order methods are applicable to the case where the source signals have (lagged) 
auto-correlation. Provided that components have nonstationary, smoothly changing variances, the 
model can be estimated as well by algorithms based on nonstationarity of signals. 

Among many extensions of the basic ICA models, several researchers have studied the case 
where the source signals are not independent (for example, Cardoso, 1998b; Hyvarinen et al., 2001a; 
Bach and Jordan, 2002; Valpola et al., 2003, see also references in Hyvärinen and Hurri, 2004). The 
dependencies either need to be exactly known beforehand, or they are simultaneously estimated by 
the algorithms. Recently, a novel idea called double-blind approach was introduced by Hyvärinen 
and Hurri (2004). In contrast to previous work, their method requires no assumption on the distri- 
butions of the sources and no parametric model of the dependencies between the components. They 
simply assume that the sources are dependent only through their variances and that the sources have 
temporal correlation. In the Topographic ICA (Hyvärinen et al., 2001a), the dependencies of the 
sources are also caused only by their variances, but in contrast to the double blind case, they are 
determined by a prefixed neighborhood relation. It should be noted that for such dependent compo- 
nent models identifiability results have not been theoretically established so far, while identifiability 
of multidimensional ICA was proven by Theis (2004). 

A Statistical basis of ICA was established by Amari and Cardoso (1997). They pointed out that 
the ICA model is an example of semiparametric statistical models (Bickel et al., 1993; Amari and 
Kawanabe, 1997a,b) and studied estimating functions for it. In particular, they showed that the quasi 
maximum likelihood (QML) estimation and the natural gradient learning give a correct solution re- 
gardless of the true source densities which satisfy certain mild conditions. In this paper, we extend 
their approach to the BSS problem considered in Hyvärinen and Hurri (2004). Investigating esti- 
mating functions for the model, we show that many of ICA algorithms based on the independence 
assumption can achieve consistent solutions in a local sense, even if there exist variance depen- 
dencies, which is astonishing and seems somewhat counterintuitive. We remark that estimating 
functions are concerned with local consistency (consistency’ will denote its local version in the 
following) and in general have spurious solutions. For a few algorithms, even global consistency 
has been proven by different principles (for example, Hyvärinen and Hurri, 2004). Nevertheless, 
our result goes beyond existing ones, because it covers most types of BSS algorithms and can give 
asymptotic distributions. The main message of this paper is that most ICA algorithms can be proven 
to be consistent in our framework although the data is not independent. So they must effectively 





1. This is an extended version of Kawanabe and Miiller (2004) presented at ICA2004. 


454 


ESTIMATING FUNCTIONS FOR VARIANCE-DEPENDENT BSS 


use some concept beyond independence. Thus our consistency results indicate that separation can 
be done based only on normalized sources which are adjusted to have stationary variances and is 
not affected by the dependent activity levels. 

This paper is organized as follows. At first, we define the variance-dependent model in Section 
2 and explain estimating functions, a useful tool for discussing semiparametric estimators in Section 
3. In Section 4, we discuss the relation between estimating functions for the ICA model and those 
for the variance-dependent BSS model in general. It is shown that these algorithms work properly, 
even if there exist spatiotemporal variance dependencies. Among several ICA algorithms, the quasi 
maximum likelihood method and its online version, the natural gradient learning are discussed in 
detail. We study the asymptotic distributions of the quasi maximum likelihood method (Section 
5.1) and the stability of the natural gradient learning (Section 5.2). We also give a brief summary 
about several other ICA algorithms from our viewpoint in Section 5.3. Detailed discussion can 
be found in Appendix A. The theoretical insights are underlined by several numerical simulations 
in Section 6. In particular, we carried out two experiments, where we extract two speech signals 
with high variance dependencies. It is sometimes believed that ICA algorithms work for mixture of 
acoustic signals or natural images because the data are sparse and often disjoint. Our results show 
that they can also separate even highly coherent signals, and our theoretical analysis can thus help 
to understand the reason. 


2. Variance-Dependent BSS Model 


Hyvärinen and Hurri (2004) formalized the probabilistic framework of variance-dependent blind 
separation. Let us assume that each source signal s;(t) is a product of non-negative activity level 
v;(t) and underlying i.id. signal z;(t), that is, 


silt) = v;(t)zi(t). (1) 
We remark that the sequences of the vectors s = (s1, ...,Sn) , V = (V1, ---,Vn)| and z= (Z1, ---,Zn)! 
are considered as multivariate random processes in this paper. In practice, the activity levels v;(t) 
are often dependent among different signals and each observed signal is expressed as 


n 
xilt) = $ ajjvj(t)zj(t), i=1,...,n, 
j=l 


where v;(t) and z;(t) satisfy: 
(i) v;(t) and z;(t’) are independent for all i, j, t and f’, 
(ii) each z;(r) is i.i.d. in time for all i, the random vector z = (z1,-..,Zn)' is mutually independent, 
(iii) z;(t) have zero mean and unit variance for all i. 


No assumption on the distribution of z; is made except (iii). Regarding the general activity levels v;’s, 
v(t) and v;(t) are allowed to be statistically dependent, and furthermore, no particular assumption 
on these dependencies is made (double blind situation). We refer to this framework as the variance- 
dependent BSS model in this paper. Figure 1 shows an example of the sources s used in the model. 
As stated in the assumption (ii) above, the normalized signals zı and z2 are mutually independent. 
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source (51,52) activity level (v1, v2) normalized signal (z1,z2) 
























































Figure 1: Sources (51,52) with variance dependencies in the variance-dependent BSS model. In the 
middle panels both v; and —v; are plotted for clarity. 


However, since the sequences z; and zz are multiplied by extremely dependent activity levels vı and 
v2, respectively, the short-term variance of the source signals sı and sz are highly correlated. 
Hyvärinen and Hurri (2004) proposed an algorithm which maximizes the objective function 


J(W) = X [60v ({w; u(t)]?, bw; u(t —Ar))°)P, 
ij 

where Cov denotes the sample covariance, W = (w1,... Wal? is constrained to be orthogonal and 
where u(t) is obtained by preprocessing the signal x(t) by spatial whitening. It was proved that 
the objective function J is maximized when WA equals a signed permutation matrix, if the matrix 
K = (K;ij) = (cov{s? (t), 8 (t — At) } ) is of full rank. This method shows good performance as long 
as there exist temporal variance dependencies and the data is not spoiled by outliers (see Meinecke 
et al., 2004). 

It is important to remark that the nonstationary algorithm by Pham and Cardoso (2000) was 
also designed for the same source model (1), except that v;(t)’s are assumed to be deterministic and 
slowly varying. However, it is straightforward to show validity of this algorithm, when v;(t)’s are 
(slowly-varying) random sequences. 


3. Semiparametric Statistical Models and Estimating Functions 


Amari and Cardoso (1997) established a statistical basis of the ICA problem. They pointed out that 
the standard ICA model? 


T n 
p(X|B,ps) = |detB|" I J [ [ps {b; x0} (2) 
t=1i=1 
is an example of semiparametric statistical models (Bickel et al., 1993; Amari and Kawanabe, 
1997a,b), where B = (b,...,b,)' = A`! is the demixing matrix to be estimated and p,(s) = 
Tl Ps,(s;) is the density of the sources s. Notations used in the following sections are also sum- 
i=l 


marized in Table 1. As the function p, in this model, semiparametric statistical models contain 
infinite dimensional or functional nuisance parameters which are difficult to estimate. Moreover, 
they even disturb inference on parameters of interest. 





2. Since the sources are assumed to be i.i.d. in time, people consider the distribution of one sample x instead of the 
entire sequence X. 
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x(t) = Go" observed data at t 

X = (x(1),...,x(T)) whole sequence of the observed data 

s(t) = (s1 (t), ---, Xn (t) ) T source signals at t 

v(t) = (vi (t), -- -Vn (t) ye general activity levels of the sources s(t) 

V = (v(1),...,v(T)) whole sequence of the activity levels 

z(t) = (z1 (t), ---,Zn(t))T normalized source signals by the activity levels v(t) 

n xn mixing matrix 

B = (bij) = (bi, si)" demixing matrix which is equivalent to A~! 

pz(z) = i Pz; (zi) density of the normalized source signals z 

pv(V) density of the entire sequence V = (v(1),...,v(T)) 
of the activity levels 

y(t) = Bx(t) extracted sources by the demixing matrix B 

F (x,B) or F(X,B) estimating function which is an n x n matrix-valued 
function of the data and the parameter B 

vec(F) vectorization operator 

= (Fy,---sFaty--0sFiny-++s Fan)! 














Table 1: List of notations used in the variance-dependent BSS model 


In the variance-dependent BSS model which we consider, the sources s(t) are decomposed of 
two components, the normalized signals z(t) = (z(t),...,Z,(t))' and the general activity levels 
v(t) = (vi(t),-.-,Vn(t))'. Since the former has mutual independence like the ICA model, the den- 
sity of the data X is factorized as 





T n T 
p(XIV:B.p:) = (aera |” TJ] T -7 P E ) | B 


t=1i=1 


when V = (v(1),...,v(7)) is fixed. Therefore, the marginal distribution can be expressed as 


P(X|B,P-, Pv) = | rXIV:B,p.)pv(Vay, (4) 


where the density py of V becomes an extra nuisance function. 

In order to construct valid estimators for such semiparametric models, estimating functions were 
introduced by Godambe (1976). Let us consider a general semiparametric model p(x|6,p), where 
O is an r-dimensional parameter of interest and p is a nuisance parameter. An r-dimensional vector 
valued function f(x,6) is called an estimating function, when it satisfies the following conditions 
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for any 8 and p (Godambe, 1991), 
E[ f(x, 8) |9, p] =0, 
æo] #0, where 0=E| Z f,0) feo], 
E [If Œ,0)|7 [0,p] <=, 


where E[-|8,p] denotes the expectation over x with the density p(x|0,p) and ||- || is the Euclidean 
norm. Suppose i.i.d. samples x(1),...,x(T) are obtained from the model p(x|0*,p*). If such a 
function exists, by solving the estimating equation 


f(=(t),8) =0, (5) 


Ms 


t=1 


we can get an estimator @ with good asymptotic property. Such an estimator that is a solution of an 
estimating equation as (5) is called an M-estimator in statistics (Huber, 1981). It can be regarded 
as an extension of the maximum likelihood method for parametric models. The M-estimator @ is 
consistent regardless of the true nuisance parameter p*, when the sample size T goes to infinity. 
Moreover, it is asymptotically Gaussian distributed, that is, ~N (0*, Av), where Av denotes the 
asymptotic variance computed by the following equation 


Av = Av(0*,p*) = =o E | f,0) FT (2,8) 





ep] (a), 


and Q = Q(8*,p*) =E J f(x, 8) | 0"; p*| . We remark that the asymptotic variance Av depends on 


the true parameters (0*,p*), but not on the data x(1),...,x(7). As we will explain in Section 4.2, 
notions of estimating functions and M-estimators were extended to non i.i.d cases. 

Although estimating functions are useful for semiparametric models, it is non-trivial to find such 
functions. Amari and Kawanabe (1997a,b) studied this problem from a geometrical point of view 
and gave a guideline for discussing estimating functions. i 

The asymptotic result guarantees theoretically that the estimator O derived from the estimating 
function converges to the true parameter 6* under mild conditions. However, we should remark 
that the asymptotic variance Av of the estimator depends on the true nuisance parameter p*. For 
example, when the matrix Q is almost singular at p*, it can happen that the asymptotic variance 
Av becomes very large. This may cause some practical problem, that is, the estimate from finite 
samples can be no longer close to the true parameter. We will revisit this issue in Section 6. 

Furthermore, online algorithms with similar consistency property can also be constructed from 
estimating functions, 


O41 = 0-1 f(x(t),9), (6) 
841 = 98,—1,R(8,) falt), 91), (7) 


where R(8) is an n x n nonsingular matrix and depends only on 8. We remark that the functions 
f(x,0) and R(6) f(x,®) give the same estimating equation, if R(®) has the inverse matrix and does 
not depend on the data x. Such functions are called equivalent estimating functions. It is also easy to 
see that the online algorithms (6) and (7) have the same equilibria points. However, their dynamics 
are different. The stability of such online learning was investigated by Amari et al. (1997). 
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4. General Properties of Estimating Functions for Blind Separation 


In this section we will at first review estimating functions for the ICA model (2) (see also Amari and 
Cardoso, 1997; Cardoso, 1997) and then discuss our contribution, that is, by defining estimating 
functions for the variance-dependent BSS model (3) and (4). 


4.1 Estimating Functions for Ordinary Blind Source Separation 


In case of the ICA model, the parameter of interest is the n x n matrix B = A~! and hence it is con- 
venient to write the estimating functions in n x n matrix form F (x,B). The conditions of estimating 
functions are reshaped accordingly as 


E| F (x,B) |B, ps] =0, (8) 
dvec{ F (x,B)} 
det here Q=E| ZEU WI Bpl, 
| e Q| # 0, where Q | ðvec(B) Ps (9) 
E [ ||F(x,B)||z |B,Ps] < %, (10) 
where vec(F) = (Fi1,...,Fn1;---,Fin,---, Fm) is the vectorization of matrices and ||- ||- denotes 


Frobenius norm. It should be noted that both in usual ICA models and in the variance-dependent 
BSS model, scales and orders of the sources cannot be determined, that is, two matrices B and PDB 
indicate the same distribution, when P and D are a permutation and a diagonal matrix respectively 
(Comon, 1994). Therefore, we can find any matrix in the equivalence class, so for notational 
convenience we will fix scales as the constraints (25) later.* 

One of the standard ICA algorithms originates from maximum likelihood estimation, which is 
asymptotically the best method if the density p, is known. Because in the semiparametric model ps 
is unknown and difficult to estimate, the idea is to use instead the maximum likelihood estimation 
under a prefixed density ps. The method is called the quasi maximum likelihood estimation, since 
the fixed Ps does not coincide with the true one. The estimator B is derived from the equation 


T 


È [7-97] =0, an 


t=1 
where y(t) = Bx(t) is the estimator of the sources, @(y) = (@1(y1),---5@n(vn))! and 
d as 
Qi(yi) = dy; log Ps; (yi). 
For the nonlinear function @;(y;), 
iyi) = tanh(cy;),  c¢>0, (12) 
P0) = y, (13) 


are often employed. The function F(x,B) = I —@{Bx}(Bx)' in (11) is an example of estimating 
functions for the ICA model, provided that it satisfies (9) and (10). It is trivial to show that it fulfills 
(8). Another example is the function 


F (x,B) = Bx(Bx)' —I + (Bx) g' (Bx) — g(Bx) (Bx)! 





3. It is clear that the variance-dependent BSS model has at least such indeterminacy. On the other hand, the identifiability 
in this case has not been proved so far. 
4. We ignore the permutation indeterminacy P, since it’s locally not problematic. 
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for FastICA (see (37) in Appendix A.1), where g(-) is a vector valued non-linear function as @(-). 
We remark that this procedure can also be derived from minimum mutual information (Yang and 
Amari, 1997) and infomax principle (Bell and Sejnowski, 1995). 

In general the quasi maximum likelihood estimator is no longer consistent because of misspec- 
ified distribution. However, in the ICA model (2), Amari and Cardoso (1997) found that the quasi 
maximum likelihood method and its online version (the natural gradient learning) give an asymp- 
totically consistent estimator, provided that F(x,B) = I — @{Bx}(Bx)' satisfies (9) and (10). In 
particular, we remark that the assumed distribution ps is not equal to the true one. This research has 
motivated us to investigate also such semiparametric procedures for the variance-dependent BSS 
model (3) and (4). In particular, we will show in Section 5.1 that the quasi maximum likelihood 
method (11) still gives a consistent estimator even under this extended situation. 


4.2 Estimating Functions for Variance-Dependent Blind Source Separation 


In the variance-dependent BSS model, in contrast to the ICA model studied by Amari and Cardoso 
(1997), the data sequence X = (x(1),...,x(7)) is not i.i.d. in time, but might have time depen- 
dencies. Therefore, we have to consider more general functions F(X ,B) of the whole sequence X. 
General estimating functions F (X,B) must satisfy 


E[F (X,B) |B, Pz, Pv] =0, (14) 
| Ovec{F(X,B)} 

| det Q| #0, where Q=E | MSH [5.9.01], (15) 

E[||F(X,B)|lz |B, P-Pv] <=, (16) 


for all (B,p,, Py). An M-estimator B can be derived from the estimating equation 
F(X,B) =0. (17) 
Suppose that the data X is subject to p(X |B*,pž, pý ) defined by (3) and (4). 


Theorem 1 Jf the function F(X ,B) satisfies the conditions (14) — (1 6) and appropriate regularity 
conditions such as Condition 2.6 in Sørensen (1999), the M-estimator B derived from the equation 


~N 


(17) is asymptotically Gaussian distributed vec(B) ~ N(vec(B*), Av), where 


Av = Av(B*,p?,p7) = Q E(Q')', (18) 
E = E(B*,p}, pý) = E [vec{F (X,B*)} vec{F (X, BY |B*, pp] 
Q m Q(B PPY) = p ee B 02.0% g 





Proof See Sørensen (1999). 


Now, we investigate the relation between estimating functions for the ICA model and those for 
the variance-dependent BSS model. Let F (x,B) be an estimating function for the ICA model. In 
the ICA context it is often the case that such estimating functions satisfy 


E[Fij(x, DB) |B, Ps] =0, iF j, (19) 
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for any diagonal matrix D, that is, its off-diagonal parts (19) hold for all matrices equivalent to B. 
The scale factor D is determined usually by the diagonal parts of condition (8) 


E[ F(x, B) |B, ps] = 0, 


in the ordinary ICA model. We will soon present the equation for fixing the scale factor D in the 
variance-dependent BSS model. 
Let us consider the function 


T 
F(X,B) =) F(x(t),B), (20) 
t=1 


which is used in estimating equations for the ICA model. We can show that this function becomes 
a candidate of estimating functions for the variance-dependent BSS model. 


Proposition 2 The function F(X ,B) defined in (20) satisfies condition (14), provided that F (x,B) is 
an estimating function for the ICA model and fulfills (19). Furthermore, if the additional assumption 


E [| ||F(x(),B)||- |B,Pa py] <<, Ve (21) 


holds, condition (16) is also satisfied. 
Proof Taking expectations of the off-diagonal terms of (14), we get 


T 
E | Fij(X,DB) |B,p.,pv] = E [EEL no0.08 |V;B, p4] 
t=1 





x 
x 


where P,),() is the density function of s(t) when its activity level is fixed at v(t), that is, 


Psy (s) = II T py { at) } 


i=l 


= E 


Mns 


E | Fi;(x(t), DB) |B, Psi] 
1 


Il 








We remark that the expectation E|-|V;B,pz] (E[-|B, Psj,(r)] ) is taken over z(t) (resp. s(t)) under fixed 
activity levels V, while E]-|py]| denotes the expectation over the activity level V. Because (19) holds 
for any ps, we can prove 

E | Fij(X, DB) |B, Pa Pv] =0, 


for all diagonal matrices D. If we select the scale factor D such that the diagonal terms 
E[Fii(X,B) |B, Pz, Pv] =0 


hold, F satisfies the unbiasedness condition (14). We furthermore note that this scaling is different 
from that in the ICA model presented before, and the expectation E | Fi (x(t), B) |B, Psi) | at each 
time ¢ can be non-zero in general. 





5. We remark that some of ICA/BSS algorithms (for example, TDSEP/SOBJ) are not based on estimating functions in 
this class. Because it is not easy to discuss them in such a general form, we deal with other classes separately in 
Appendix A. 
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The left hand side of Eq. (16) can be expressed as 
E[||F(X,B)|lz |B,P:,Pv] 


LE | wf (e(e),B) FT (x(),B)} |VsB, p:] 


by 
E i [ IF (x(t), B) lz |B, Psa] p 


B| Z ÈE[F: B) |B, Ps|v( n| E | Fi(x(t'),B) |B, Psia] 


tAt i= 

















by ; (22) 


where we used the fact that x(t) and x(t’) (t # t’) are independent for fixed V. From assumption 
(21), the first term of Eq. (22) is finite. 


LELEL IFO), B)\lz |B, Pspe)] [Pv] 


YE [IIF œ), B)? |B, Pz Pv] <% (23) 


We remark that condition (10) does not necessarily imply assumption (21). Let us define 
c(v(t)) = E [ IF (0), B)|IF |B; Pso)]. Since 





IE [ Fi(=(t);B) |B; psc] | < YEL FR(«().B) |B; Psr] < VECO), 


the second term of Eq. (22) (called r in the following) can be bounded as 


rl < n LE| Veo) Vee) ov] 





tAr 
2 
< o ie} 
< HERL )) [Pv]. 





Here we used Schwarz’s inequality twice. Because of Eq. (23), this bound is also finite. 











The basic idea of this proof is that the situation becomes similar to the ordinary ICA model, if 
the activity levels V are fixed. Unfortunately, the other conditions are difficult to be proven in this 
general form. For example, the second condition can be transformed in the similar way as 


| dvec{F (X,B)} 
dvec(B) 





B..P»| 


dvec{ F (x(t),B)} 


T 
LE E | dvec(B) [Boso 


ee ee |B, Paint) is non-singular, it may still be possible that the sum 








p : (24) 


Even if each term E [ 
(24) becomes singular. However this is in practice an extremely rare case. 
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5. Consistency Results for Variance Dependent Blind Source Separation Using the 
Estimating Function Framework 


We will use the estimating function framework to prove consistency results for (i) the quasi max- 
imum likelihood type methods (for example, Pham and Garrat, 1997; Bell and Sejnowski, 1995), 
(ii) the natural gradient learning for ICA (for example, Amari, 1998) and (iii) various other ICA al- 
gorithms such as FastICA (Hyvärinen and Oja, 1997), TDSEP/SOBI (Ziehe and Müller, 1998; Be- 
louchrani et al., 1997), *Sepagaus’ (Pham and Cardoso, 2000) and JADE (Cardoso and Souloumiac, 
1993). 


5.1 Asymptotic Distribution of the Quasi Maximum Likelihood Estimator 


In this section, it is shown that the quasi maximum likelihood method (11) as for example Pham and 
Garrat (1997); Bell and Sejnowski (1995) still gives a consistent estimator even under the extended 
model (3) and (4). For convenience, we fix the scales of the recovered signals as 


T 
E | E o1{b; x(t) } bi x(t) 
t=1 





Boom =T, (25) 


fori=1,...,n. Then (14) is automatically fulfilled for the diagonal terms. We remark that by this 
constraints the length of b;’s may depend on the nuisance parameters (p,,py), but this does not 
change the following discussion, because the scales can be fixed arbitrarily. 

Since the function F (x,B) = I — {Bx}(Bx)' obviously satisfies (19), we already know from 
Theorem 2 that the function 


T 


FML B) =) [1-900 | 


t=1 


satisfies the conditions (14) and (16) under the assumption 


E[o?Dilt)}y3@) |B, Pa Pv] <<, Vi it, (26) 


where y(t) denotes the extracted sources Bx(t). The additional assumption imposes mild restriction 
on the distribution of the activity levels V. For example, when the density py has extremely heavy 
tails, the left hand side of Eq. (16) becomes infinite, even if condition (10) is fulfilled. Thus, we 
need assumptions like (26) to exclude such unusual cases. 

For better understanding, we directly analyze the off-diagonal terms of (14) 
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E| E[9;{vi(t)zi(t)} |V;B,P-]E | v;(t)z;(t) |V;B,p-] |pv] = 0. 


ll 
a 


The second equality follows from the fact that z; and z; are independent for fixed vo 





6. This unbiasedness in fact holds under a wider condition E[s;(t)|s;(-), j 4 i] =0. 
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To prove condition (15) and compute the asymptotic variance (18), we calculate the n? x n? 


matrix Q. If we use the non-holonomic basis dy = dBB~! (Amari et al., 2000), Q is expressed as 


_ | dvec(FOM") 
a | dvec(%) 





B, pap B, 


where B = (Bijz) and Bij. = ikbij. Fortunately, the matrix E ene 


QML 
erect: | turns out to have a 


simple structure such that only the following 2n? — n components are non-zero, 
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in which we employed the following quantities 
ki{vi(t)} E[ oi{vi(t)zi(t)} |V;B, P2]; 
mi{vi(t) } vi(t) E[ o{vi(t)zi(t)} F(t) |V;B, p], 


and @; is the derivative of @;. Hence, it is not difficult to check non-singularity of this matrix, and if 
this is the case, the condition (15) holds. We can also explicitly calculate the inverse matrix 


= ae B(E dvec(FOML) -1 h aih A . 18). b ae 
OFS aren that appears in the asymptotic variance (18), because we only have to 
invert the 2 x 2 matrices. 


Finally, the variance of the estimating function can be computed as 
QML QML 
E [FRM FB, p.. pv | 
Y'cov [gif{yi(t) }yi(t), gtt), i=j, k=! 
t,t! i 
= 4 VELoiM} aD], j=l, iżjork#l 
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VElotyi) bye], i=l, j=k,i# j 
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which is slightly more complicated than the standard ICA model. Summing up the discussion above, 
we get the following theorem. 


Theorem 3 Suppose that the conditions 
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and assumption (26) hold. Then the function FÈ”! (X B) satisfies the conditions (14) — (16) and 
becomes an estimating function. In that case, the quasi maximum likelihood estimator BML de- 
rived from the equation F2M" (x Bw) = 0 is consistent regardless of the true nuisance functions 
(p2, Py) under appropriate regularity conditions. 


5.2 Stability of the Natural Gradient Learning 


In neural networks and machine learning, online leaning is often preferred to batch learning because 
of computational efficiency, less memory and adaptability (see, for example, Miiller et al., 1998; 
Murata et al., 2002). The natural gradient learning (Amari, 1998) 


B(t +1) = Bit) +n(¢) I -epOW O B(t), (29) 


is an online algorithm based on the quasi maximum likelihood method, where y(t) = B(t)x(t) is the 
current estimator of the sources and n (t) is an appropriate learning constant. 

Following the discussion in Amari et al. (1997), we will study the stability of the natural gra- 
dient learning for the variance-dependent BSS model. For the sake of simplicity, they analyzed a 
continuous version of the algorithm (29) 


Bl) = ult) -pOT O| BO, (30) 


where B(t) denotes time derivative of the matrix B(t), u(t) = (t)/t and t means the sampling 
period. Suppose that the marginal distributions of the activity levels v(t) are identical in time. 
For example, when the sequence V is generated from an AR process, this holds approximately 
after it reaches the equilibrium distribution. Although the random variables v(t)’s (activity levels) 
have an identical marginal distribution in time, their realization can fluctuate from time to time and 
weak nonstationary structures can be found in the observed signals. Unfortunately, it is difficult to 
eliminate this rather strong assumption. If we apply the online algorithm (29) to data with highly 
nonstationary variances like speech, the scale factor of the demixing matrix B changes substantially 
from time to time and never converges. This makes the current stability analysis impossible. It 
might be possible to discuss these cases by considering only the equivalence class, but it is out of 
the scope of the current paper. 
In order to fix the scales of the sources, we impose constraints 


E [ob x(1)} bj x(t) =1, Wi. 81) 


Note that the marginal distribution of x(t) is identical in time t and the equilibrium points Bo of the 
equation (30) satisfy 


E[r- efyol(t)}99 ()] =0, (32) 
where yo(t) = Box(t). With a similar calculation as in Section 5.1, we can show that the function 
FNS(x,B) =I—9(y)y! 


satisfies the unbiasedness condition (8) of estimating functions. This means that the true demixing 
matrix B* satisfies the equilibrium equation (32), that is, B* becomes an equilibrium point of the 
flow (30). However, it does not guaranteed that B(t) converges to B* even locally. 
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Let us fix the stochastic process V = {v(t), t > 0} of the activity levels at first and consider the 
conditional expected version of the learning equation 


Bl) =n) El-p O |v] BO. 
By linearizing it at the equilibrium point B*, we have the variational equation 


dvec{ E [FNS (x(t), B*) |V] B*} 


vec{SB(t)} = u(t) dvec(B) 





vec{dB(t)}, 


where 5B(t) is a small perturbation. Therefore, we have to check the eigenvalues of the operators 

dvec{ E| FNS (x(t),B*) |v] 
dvec(B) 

rium B* is asymptotically stable for the fixed activity levels V. Since the matrix can be expressed 


as 


z9 for each ż > 0. If all eigenvalues have negative real parts, then the equilib- 





dvec{ E | FNS (x(t), B*) |V] B*} = dvec (E[ FNS|V] ) 
dvec(B) = d vec(X) 


where B* = (Bi. i) = (Sixb7;), and derivative w.r.t. x corresponds to the non-holonomic basis 
dvec( E[FNS|V] ) 


(B), (33) 








dy = dBB~'. Because the left hand side of (33) is a similar transformation of , their 


eigenvalues are the same. Fortunately, as is the case of the quasi maximum likelihood, the matrix 
dvec( E[FNS|v]) 


Byes) has a simple structure such that only the following 2n? — n components are non-zero, 


dE [FSV] 
OX i 
dE[F,°|V)  OE[FRS|V] 
OX ij dX ji = -( ki{vi(t)} v5 (t) 1 ) 
QE[FNS|V]  dE[FNS|\v] 1 kiiver) 
OXij OX ji 





= mi{v;(t)}—1 








dvec( E[FNS|V] ) 


Therefore, the matrix at time ¢ has eigenvalues only with negative real parts, if and 


: dvec(X) 
only if 
mi{vi(t)} +1 >0 (34) 
ki{vi(t) } >0 (35) 
VOO kitvi(t)}ki{vj()} > 1 (36) 


for all i, j(i Æ j). 


Theorem 4 Jf the stochastic process V = {v(t), t > 0} of the activity levels satisfies the conditions 
(34) — (36) with probability 1 as for the true parameter (B*,p}, pọ), then the true demixing matrix 
B* becomes an asymptotically stable equilibrium of the flow (30) with probability 1. 


Although asymptotic stability could be proved under weaker conditions, we summarize the 
discussion as Theorem 4 for simplicity. In order to understand the result better, we revisit the 
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examples presented in Amari et al. (1997). The conditions turn out to be much harder than those by 
Amari et al. (1997) because of the fluctuating activity levels. 


Example 1. Let us consider the following odd activation function 
gi(yi) = |yi/Psign(yi) 
for p = 1,2,.... The conditions (34) and (35) are automatically satisfied for any fixed v;(t) > 0. 
mO} = P OEH] >0 
k{vi(t}} = pvP'(t) E[lz(@)|?-"] > 0 
The condition (36) becomes 


P ORTO E [laO] EO] > 1. 


By introducing Gray’s norm 





Ejja] 
Yi = lel ail?) 


-1 
and taking notice of the normalization constraints (31), that is, E[|z;|?*!] = (El? K , finally we 


obtain 











P min vP (t) min weli) 
Ypipj < P +1 FI 
E ape 


For the cubic function @;(y;) = y;, not as in the ICA model, the condition that all signals are sub- 
Gaussian 


E[[zil4] 
(El lz)" 


is not enough, but the variation of activity levels v; from (1) should be taken into account. 


Example 2. Let us consider a symmetrical sigmoidal function 


i(yi) = tanh(Byy). 


The conditions (34) and (35) can be checked easily. Unfortunately, in this case we can only do a 
rather coarse analysis as follows. Let us assume B < 1 so that the approximation 


1 2 
Pii) ~ Byi — 3 (By) + 75 (By)? 


holds with high probability. Then, we can express the condition (36) as 


B°vF(6)v4(0) E| 1 — (Bri)? +Z Bv 





v| efi- Gy)? +5601 





v|>1. 


Because 1 — £? + 24/3 > t*/3, we get a stronger condition 


pt 
a vi (t) v5 (t) Eb v] E[y4|V] > 1. 
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2 


From a rough approximation of (31), the relation B ~ (Ely; ]) “| is derived. Therefore, if all approx- 


imations are accurate enough, we finally get a sufficient condition of (36) like 


3 
Vij > Z Elvi E[l 
ia B4 s min; 





minv 
t 


In contrast to the ordinal ICA model without variance dependence, the condition that all signals are 
super-Gaussian may not be enough, but each kurtosis y3; should be much larger than 3.7 


5.3 Properties of Other BSS Algorithms 


Although we concentrated on estimating functions of the form (20), we can deal with more general 
functions and investigate other ICA algorithms within the framework of estimating functions and 
asymptotic estimating functions (see also Cardoso, 1997). Such analysis helps to check whether 
these algorithms may give valid solutions regardless of the nuisance densities (p;,py). We re- 
mark that our extension enables us to analyze algorithms based on temporal structure such as TD- 
SEP/SOBI (Ziehe and Miiller, 1998; Belouchrani et al., 1997). Since it is quite technical, the de- 
tailed discussion is put in Appendix A, where the unbiasedness condition (14) of estimating func- 
tions is examined for these algorithms under the variance-dependent BSS model. We briefly sum- 
marize the consequences in Table 2. Estimators by all algorithms listed below are derived from 
estimating equations which satisfy the unbiasedness condition at least asymptotically. When the 
other conditions are taken into account, TDSEP/SOBI never works for the variance-dependent BSS 
model, because sources have no lagged auto-correlations. ICA algorithms using non-Gaussianity 
such as FastICA and JADE are not working, if sources are Gaussian. The double blind algorithm 
(Hyvärinen and Hurri, 2004) cannot be applied to the case where the variance structures of sources 
are the same or there is no temporal variance-dependency. The nonstationary algorithm by Pham 
and Cardoso (2000) is not applicable to the case where time courses of the activity levels are pro- 
portional to each other. Of course, such a theoretical analysis tells us only about the possibility of 
failure. In practice, algorithms do not always return valid answers, because of local minima and 
numerical instability of their learning process. Nevertheless, this theoretical analysis can explain 
the results of our numerical experiments in the next section. 


6. Numerical Experiments 


We carried out experiments with several artificial and more realistic data sets for several BSS al- 
gorithms. The eight batch algorithms and the online versions of the quasi maximum likelihood 
methods listed in Table 3 were applied to those data sets. Note that our goal is not primarily an 
algorithm comparison but the experiments serve to demonstrate the correctness of our theoretical 
analysis. 


For evaluating the results, we used the index defined by Amari et al. (1996) 


n te NE n nid, 
Amarilndex(B,A*) = )° { Y= Gil i} ry { tl d i\. 
j=1 


2 | maxz ICix| maxx |Ckj 








7. This different result corrects a calculation in Amari et al. (1997). 
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algorithm unbiasedness_ | unavailable cases 

FastICA yes Sources are Gaussian. 
Hyvarinen (1999) 

double blind asymptotically | Variance structures are same or 
Hyvärinen and Hurri (2004) there is no temporal variance-dependency. 

JADE asymptotically | Sources are Gaussian. 
Cardoso and Souloumiac (1993) 

TDSEP/SOBI yes always (since we consider here 
Ziehe and Miiller (1998) only the case without auto-correlations) 
Belouchrani et al. (1997) 

nonstationary yes Time course of the activity levels are 
Pham and Cardoso (2000) proportional to each other. 





Table 2: Availability of other ICA and BSS algorithms 











QML(tanh) quasi maximal likelihood method with the hyperbolic tangent nonlinearity 
QML(pow3) quasi maximal likelihood method with the cubic nonlinearity 
Online(tanh) online version of QML(tanh) with learning rate y(t) = (1/20) 
Online(pow3) online version of QML(pow3) with learning rate y(t) = (41/20) 
*DoubleBlind’ | the double blind algorithm by Hyvärinen and Hurri (2004) 

JADE JADE algorithm 

FastICA(tanh) | FastICA with the hyperbolic tangent nonlinearity 

FastICA(pow3) | FastICA with the cubic nonlinearity 

TDSEP/SOBI | TDSEP/SOBI algorithm 
’Sepagaus’ The ’sepagaus’ algorithm for nonstationary signals 
by Pham and Cardoso (2000) 


0.1 


0.25 
t 








Table 3: ICA and BSS algorithms used in the experiments 
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where A* is the true mixing matrix and C = BA*. If B = PD(A*)~! with a permutation matrix P and 
a diagonal matrix D, then AmariIndex(B,A*) = 0. 


6.1 Artificial Data Sets 


In all artificial data sets, five source signals of various types with length T = 10000 were generated 
and data after multiplying a random 5 x 5 mixing matrix were observed. We made 100 replications 
for each setting and compute the demixing matrix for each replication. The first data set was made 
according to the experiments in Hyvärinen and Hurri (2004). The activity levels v(t) were generated 
from a multivariate AR(1) model, where outliers larger than three times standard deviations from 
the means were reduced to these bounds. The normalized signals z;’s were i.i.d. sub-Gaussian ran- 
dom variables which are signed fourth-order roots of zero-mean uniform variables. The medians of 
the 100 replications are summarized in the row ’ar_subG’ of Table 4 with the measure of deviation 
(3rd-quantile — Ist-quantile)/2. As was pointed out by Hyvärinen and Hurri (2004), only ’ Double- 
Blind’ gave small AmariIndex. Because the marginal distribution of the source signal s;(t) looks 
like a Gaussian, all algorithms based on indices favouring non-Gaussianity failed. Even though 
the determinant in the left hand side of (28) is close to 0, all the assumptions are satisfied and the 
local consistency theorem is still valid. However, this does not directly mean that the estimated 
demixing matrix converges globally to the true one. In this case, many local optima can make the 
algorithms fail. This could also be understood from the fact that the contrast functions based on 
non-Gaussianity become almost flat and thus are very difficult to optimize. In the experiments, we 
observed that part of the true sources were often extracted correctly. 


Although all the algorithms except for ’DoubleBlind’ did not work for the first difficult exam- 
ple, the theoretical study in principle tells that many ICA and BSS algorithms are also applicable 
to the variance-dependent BSS problem. So in fact the failure of the algorithms except ’Double- 
Blind’ can be solely explained by the particular choice of the data set which is in contrast to prior 
findings in Hyvärinen and Hurri (2004). In the second example, uniform random variables were 
used as z;’s instead of sub-Gaussian ones. The marginal distribution of the source signal s;(t) looks 
Laplacian. Therefore, as was shown in the row ’ar_uni’ of Table 4, the algorithms QML(tanh) and 
FastICA(tanh), which are suitable for super-Gaussian sources, always give correct answers. The 
algorithms ’DoubleBlind’, JADE and FastICA(pow3) based on 4-th order moments also worked ex- 
cept several failures due to outliers. We got admissible results by the nonstationary BSS algorithm 
*Sepagaus’, if an appropriate smoothing window was chosen. 


In the third and the fourth data, the activity levels v;(t) are sinusoidal functions with different 
frequencies. 


B _ (03 +i)nt S 
w(t) =1+0.9sin (OE F bday 


For the normalized signals z;, Laplacian and the sub-Gaussian i.i.d. random variables were used 
in the third and the fourth examples, respectively. In the super-Gaussian case (the row ’sin_supG’ 
of Table 4), the six algorithms except QML(pow3) and TDSEP/SOBI worked properly. *Sepa- 
gaus’ showed best performance, and QML(tanh) and FastICA(tanh) based on the hyperbolic tan- 
gent nonlinearity gave better results than ’DoubleBlind’, JADE and FastICA(pow3) with 4-th order 
moments. On the other hand, in the sub-Gaussian case (the row ’sin_subG’ of Table 4), the six algo- 
rithms except QML(tanh) and TDSEP/SOBI returned admissible results. *Sepagaus’ also showed 
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QML(tanh) QML (pow3) Online(tanh) Online(pow3) | ’DoubleBlind’ 
ar_subG 8.25 (1.85) | 11.32 (2.84) 14.59 (2.29) | 14.75 (2.32) 0.52 (0.10) 
ar_uni 0.30 (0.04) | 27.77 (0.32) 0.51 (0.08) | 23.40 (1.88) 0.70 (0.16) 
sin_supG 0.17 (0.02) | 29.97 (0.26) 0.39 (0.05) | 28.74 (0.96) 0.79 (0.13) 
sin_subG 19.21 (0.24) 0.32 (0.05) 21.51 (2.08) 0.57 (0.30) 0.27 (0.03) 
com_supG 0.39 (0.06) | 28.37 (0.27) 0.64 (0.09) | 25.67 (1.74) 6.45 (1.56) 
com_subG || 26.53 (0.55) 0.14 (0.02) 27.00 (2.41) 0.28 (0.05) | 22.05 (1.96) 
exp_supG 0.35 (0.05) | 28.43 (0.45) 0.59 (0.07) | 22.84 (2.06) 7.63 (1.88) 
uni_subG || 27.38 (0.17) 0.13 (0.02) 27.24 (1.27) 0.27 (0.04) 18.56 (1.66) 
SSS 0.03 3.82 0.06 (0.01) 2.79 (0.53) 0.02 
v12 0.01 3.73 0.06 2.89 (0.04) 0.21 
JADE FastICA(tanh) | FastICA(pow3) | TDSEP/SOBI ’ Sepagaus’ 
ar_subG 10.79 (1.88) 9.25 (1.98) 12.52 (2.05) | 15.07 (1.96) 1.19 (0.48) 
ar_uni 0.66 (0.14) 0.38 (0.05) 0.73 (0.14) | 14.92 (2.37) 0.85 (0.22) 
sin_supG 0.43 (0.07) 0.23 (0.03) 0.41 (0.07) | 15.31 (2.04) 0.08 (0.01) 
sin_subG 0.31 (0.04) 0.68 (0.14) 0.33 (0.05) | 15.70 (1.94) 0.08 (0.01) 
com_supG 0.84 (0.16) 0.48 (0.07) 0.87 (0.14) | 16.02 (2.05) 1.28 (0.19) 
com_subG || 26.49 (0.86) | 27.04 (0.38) 26.65 (0.17) | 16.23 (2.01) | 27.08 (0.40) 
exp_supG 1.24 (0.23) 0.44 (0.06) 1.20 (0.22) | 16.47 (1.81) 1.28 (0.20) 
uni_subG 0.17 (0.03) 0.18 (0.03) 0.18 (0.03) | 16.20 (1.78) | 27.08 (0.33) 
SSS 0.02 0.19 (0.04) 0.09 (0.01) 0.01 0.01 
v12 0.19 0.17 (0.02) 0.08 (0.09) 0.14 0.01 








Table 4: Amarilndex of the estimators. The values are the medians of 100 replications with the 
measure of deviation, (3rd-quantile — 1st-quantile) /2 


best performance, and all four algorithms with 4-th order moments showed better performance than 
the FastICA(tanh). 

The double blind algorithm (DoubleBlind’) by Hyvärinen and Hurri (2004) does not work 
when (i) all v;’s have same temporal structure, and (ii) there exist no temporal dependencies in v;’s. 
*Sepagaus’ does not have a guarantee to separate sources either, because smoothed sequences of the 
activity levels are nearly proportional to each other (see Table 2). The fifth and sixth data set are 
examples of the case (i), where v;(t) are the same sinusoidal functions. 


v(t) = 1+0.9sin (Z5), 


E 
500 Po anne 


As in the third and the fourth examples, Laplace and the sub-Gaussian i.i.d. random variables were 
used for the normalized signals z;. As in the row ’com_supG’ of Table 4, the five algorithms ex- 
cept QML(pow3), ’DoubleBlind’ and TDSEP/SOBI worked properly. Among them, QML(tanh) 
and FastICA(tanh) had better performance. ’DoubleBlind’ gave poor results, because the matrix 
K; A cov{s; (),s4C — 1)} is almost singular. In the sub-Gaussian case, it looks quite difficult to 
distinguish the sources visually. Unfortunately, we could not demix them correctly except with 


QML(pow3) as shown in the row ’com_subG’ of Table 4. In order to check why other algorithms 


471 


KAWANABE AND MULLER 


with the local consistency did not work, we carried out extra experiments with larger sample size 
T. When T = 200000, the AmariIndices of the estimated demixing matrices by JADE are below 
0.11, 92 times out of 100 repetition. On the other hand, both FastICA methods returned valid re- 
sults almost always (AmariIndices are below 0.22, 89 times for FastICA(tanh) and 100 times for 
FastICA(pow3) ), if T = 50000 and the algorithms start from the true demixing matrix. Therefore, 
we think that the global convergence is not achieved in these cases, because of finite sample size 
effects and local optima. 

The seventh and the eighth data sets are examples of the case (ii), where v(t) is i.i.d. in time ¢. In 
the former example, we transform 5 independent exponential random variables linearly such that v; 
and v; have correlation 0.9, and z;’s were i.i.d Laplace random variables. On the other hand, in the 
latter example, v(t) was generated from 5 uniform random variables by the same linear transforma- 
tion and z;’s were the i.i.d sub-Gaussian random variables. As one can see in the row ’exp_supG’ of 
Table 4, the results are similar to the data set ’com_supG’. On the other hand, in the sub-Gaussian 
case summarized in the row ’uni_subG’ of Table 4, QML(pow3), JADE, FastICA(tanh) and Fas- 
tICA(pow3) gave correct results, but °’ Sepagaus’ showed very poor performance. We remark that in 
both cases, ’DoubleBlind’ did not work as was expected. 

We would now like to digest the results from Table 4 and relate them to our theoretical find- 
ings. We have shown that all algorithms except for TDSEP/SOBI have the local consistency for 
most of the given data. However, this does not directly mean that they converge globally to the true 
solution. Although we hope that algorithms with a local consistency work properly, we sometimes 
see significant deviations from this expectation in practice as in Table 4. The algorithmic failures 
are caused by local optima as pointed out above for the data set ’ar_subG’, or more importantly to 
numerical stability and convergence issues. For example, since learning algorithms like gradient 
descent are used for QML(tanh) and QML(pow3), desired solutions (equilibria) turn out to be in- 
stable for sub-Gaussian (QML(tanh)) and super-Gaussian signals (QML(pow3)). In our data sets, 
’ar_uni’, all data sets with ’supG’ and acoustic signals are super-Gaussian, while all data sets with 
*subG’ except ’ar_subG’ are sub-Gaussian. One can see the clear pattern in the columns QML(tanh) 
and QML(pow3). The online version Online(tanh) and Online(pow3) had slightly degraded per- 
formance with appropriate learning rate, if the batch version QML(tanh) and QML(pow3) worked, 
respectively. On the other hand, although FastICA uses similar criteria for non-Gaussianity, it em- 
ploys a kind of Newton’s method and so the desired solutions are automatically better stabilized. In 
the columns FastICA(tanh) and FastICA(pow3), except for the difficult case °com_subG’ and nearly 
Gaussian case ’ar_subG’, both algorithms succeeded. 


6.2 Variance-Dependent Speech Signals 


Next we will deal with more realistic data sets. Speech and audio signals have often been used 
as sources s(t) even for experiments of the instantaneous ICA model. In order to check whether 
variance-dependency matters to many ICA and BSS algorithms, we applied BSS algorithms to 
speech signals which have strong variance-dependency. 

In the first experiments ’sss’,® we took two speech signals with length T = 120976, where 
one speaker says digits from 1 to 10 in English, and the other speaker counts at the same time in 
Spanish. We used the separated signals of their second demo as the sources, because their separation 
quality is good enough. Figure 2 shows the sources and the estimators of their activity levels with 





8. The signals were downloaded from http: //inc2.ucsd.edu/~tewon/. 
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an appropriate smoother. We inserted one short pause at different positions of both sequences to 
make correlation of the activity levels of the modified signals much larger (0.65). In the second 
experiments ’v12’,? we took two speech signals from Japanese text (T = 48000). Figure 3 shows 
the sources and the estimators of the activity levels. We extended and shorten each syllable of the 
second sequence and tuned its amplitude such that the two sources have high variance-dependency. 
Correlation of the activity levels of the arranged signals becomes 0.74. 


40000 80000 120000 


[Ste Apt AMS A 


40000 80000 120000 




















Figure 2: The sources of the data set ’sss’ and the estimators of their activity levels. The upper 
panel contains the signals showing counting from 1 to 10 in English and Spanish. The 
lower panel shows their activity levels with an appropriate smoother. 
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Figure 3: The sources of the data set ’v12’ and the estimators of their activity levels. The upper 
panel are signals from Japanese sentences. The lower panel shows their activity levels 
with an appropriate smoother. 


A 2 x 2 mixing matrix A was randomly generated 100 times and 100 different mixtures of the 
source signals were made. The results are summarized in the rows ’sss’ and ’v12’ of Table 4. In 





9. The signals can be downloaded by http: //www.islab.brain.riken.go.jp/~mura/ica/vl.wav and v2.wav. 
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both experiments, QML(tanh), JADE, TDSEP/SOBI and ’Sepagaus’ always worked, while Fas- 
tICA(tanh) and FastICA(pow3) gave admissible results except for several cases. Although TD- 
SEP/SOBI is not applicable to the variance-dependent BSS model, it also returned correct results. 
This means that the speech signals are not perfectly matching the model Eq. (4), but the sources 
have furthermore a lagged autocorrelation. QML(tanh) always returned wrong answers, because 
speech is usually super-Gaussian. 


7. Conclusions 


In this paper, we discussed semiparametric estimation for blind source separation, when sources 
have variance dependencies. Hyvärinen and Hurri (2004) introduced the double blind setting where, 
in addition to source distributions, dependencies between components are not restricted by any 
parametric model. In the presence of these two nuisance parameters (densities of activity level and 
underlying signal), they proposed an algorithm based on lagged 4-th order cumulants. Although 
their algorithm works well in many cases, it fails if (4) all v;’s have similar temporal structure, or (ii) 
there exist no temporal dependencies in v;’s. Furthermore it also suffers from outliers. 


Extending the semiparametric approach (Amari and Cardoso, 1997) under variance dependen- 
cies, we investigated estimating functions for the variance-dependent BSS model. In particular, we 
proved that the quasi maximum likelihood estimator is derived from an estimating function, and is 
hence consistent regardless of the true nuisance densities (which satisfy certain mild conditions). 
We also analyzed other ICA algorithms within the framework of (asymptotic) estimating functions 
and showed that many of them can separate sources with coherent variances. This is in contrast 
to previous understanding of the mechanisms underlying ICA algorithms. Theoretically we have 
shown that at least asymptotically all BSS algorithms except for TDSEP/SOBI have the local con- 
sistency, thus they should succeed on a given mixed data. However, local consistency does not 
necessarily guarantee global convergence to the true solution and we sometimes see significant de- 
viations from this expectation in practice. The algorithmic failures are due to many local optima 
and more importantly due to numerical stability and convergence issues. 


Although almost all ICA and BSS algorithms could not give correct answers in the numerical 
experiment of Hyvärinen and Hurri (2004), we showed here that this was mainly a matter of the 
specific choice of the data set. In fact, most ICA and BSS algorithms also work well in many other 
benchmark examples that use dependent data. In particular, we carried out two experiments with 
highly variance-dependent speech signals. Despite the dependence typically found in speech, most 
ICA and BSS algorithms yield excellent separation results and our theoretical analysis can help 
to understand the reason for this fact. We conjecture that it is not the coarse amplitude structure 
(e.g. from dependence) that matters for BSS but the statistical fine structure of the signals. 


In this paper, we only tested existing ICA and BSS algorithms and pointed out that some of 
them are applicable to the variance-dependent BSS model. Future research will go one step further 
and construct more efficient or robust semiparametric algorithms. Note also that in practice, it is 
important to analyze how to select the best BSS method for a specific, say, variance-dependent data 
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set. We think that suitable methods might be developed along the lines of Meinecke et al. (2002) or 
Harmeling et al. (2004). 
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Appendix A. Comments on Other Selected BSS Algorithms 


We will discuss in the following the local consistency of ICA/BSS algorithms except the quasi 
maximum likelihood method. 


A.1 FastICA 


FastICA is one of the standard algorithms for blind source separation. Let u(t) = C~!/?x(t) be the 
whitened data, where C = +)/_, x(t)x'(r) is the sample covariance. FastICA gives the demixing 
matrix W = (w1,.-.,Wn)! which maximizes the total non-Gaussianity 


= 


1 


LT G{w; u(t)} 


i 
IMs 


t=1 


under the orthogonality condition WW ' =I. We use, in the following the notation W for the demix- 
ing matrix after whitening in order to distinguish it from the total demixing matrix B = WC~!/? 
including whitening process. Here G is a nonlinear function which is introduced to approximate the 
negentropy (Hyvärinen et al., 2001b). By solving the constrained optimization problem, we see that 
the estimator of W must satisfy the estimating equation 
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boy (t)-I+y(t)g' DE -spe e) =0 (37) 
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where y(t) = Wu(t). If we write the total demixing matrix as B = WC~'/?, y(t) can be expressed as 
Bx(t). The vector function g(y) consists of the derivatives g(y;) = G’ (y;), that is, gy) = (81), ---,8On)) 
The functions (12) and (13) are also used as the function g. We remark that the equation (37) is 
equivalent to 
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because the left hand side of (38) is antisymmetric, while that of (39) is symmetric. If we determine 
the scales of the sources such that 


e Eels ae 





Pn =T, i=1,...,n, (40) 


then it is easy to show that the expectations of the left hand side of (38) and (39) vanish regardless 
of the nuisance functions p; and py, in the same way as for the quasi maximum likelihood method. 
This means that the left hand side of (37) satisfies the unbiasedness condition (14) of estimating 
functions. If the other regularity conditions hold, it becomes an estimating function and the esti- 
mator B derived from it converges to the correct demixing matrix B* = (A*)~! with a permutation 
matrix P and a diagonal matrix D. Although the estimating function is similar to that of the quasi 
maximum likelihood, FastICA algorithm is based on the Newton’s algorithm, and therefore, it has 
globally more stable dynamics than the natural gradient learning. 


A.2 The Double Blind Algorithm by Hyvarinen and Hurri (2004) 


Hyvärinen and Hurri (2004) proposed an algorithm for separating sources under the double blind 
situation. The estimator is obtained by maximizing 


IW) =F [covfy?(-),97(-— an}, 
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under the orthogonality condition WW ' = J, where 
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that is, the empirical cumulants of the source signal s(t) = (A*)~!x(r) converge to their expectation, 


where 
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By ignoring higher-order terms, we get 


= (da Krd) 3) 
jkl 


where Q = (qij) = BA* and B = WC -1/2 indicates the demixing matrix without whitening. Pro- 
vided that the matrix K = (Kj;) is non-singular, the quantity J is maximized when Q is a signed 
permutation matrix, that is, by maximizing the criterion J we can estimate the true demixing ma- 
trix B* = (A*)~! up to signed permutation matrices. This also means that the algorithm does not 
work if there is no temporal covariance dependencies (for example, the data sets ’exp_supG’ and 
*uni_subG’ in our experiment), or all sources have exactly same temporal covariance dependencies 
(for example, the data sets °com_supG’ and ’com_subG’ in our experiment). 

Although the authors have already given its validity as mentioned above, we will check its 
estimating equation. By solving the constrained optimization problem, we see that the estimator is 
obtained from the estimating equation 











F(X,B) =0, (41) 
where 
$ T T 
Fij L iyt) — Siy} + D L(K Ri — Kj yr (t — At )y:(t)y;(t) 
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+E Ri- Riy Eyit — At)yj(t — Ar) (42) 
l 
and Ri; = cov{y?(-),y;(- — At) }. By replacing Rij with Kj;, let us define the function 
T 
Fi(X,B) = Lov yilt)— Sih L [E Kai- Kii E- Arye) 
t=At+1 | 1 
+h Kyi — Kij)yi (t)yi(t — At)y;(t— Ar) | . (43) 





Suppose that F(X,B) is an estimating function which fulfills F(X,B) = O,(T'/), when B is the 
true parameter. If the function F(X, B) satisfies 


F(X,B) = F(X,B) +0,(T!/?) (44) 


for any B such that ||B — B|| = O(T~'/?), it can be shown that the residual does not matter to 
the asymptotic property of the estimator and the solution B of (41) is asymptotically equivalent 
to that of the equation F(X,B) = 0 (see Cardoso, 1997). In fact, we can prove (44) under mild 
conditions, that is, the difference between the functions (42) and (43) can be neglected. Therefore, 
we will check whether F(X, B) actually satisfies the conditions of estimating functions. If we take 
the constraints (40) to determine the scales of the sources, the unbiasedness condition (14) follows 
from uncorrelatedness of the sources and 
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T 
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We remark that the expectations are taken with respect to p(X|B,pz,Pv), and therefore 

y(t) = Bx(t) = s(t) holds. If the other regularity condition holds, F(X ,B) turns out to be an asymp- 
totic estimating function which is asymptotically equivalent to an estimating function and the esti- 
mator B converges to the correct demixing matrix B* = (A*)~!. 


A.3 JADE 


Although in a rigorous sense, the asymptotic properties of JADE should be analyzed as in the 
previous section (see also Cardoso, 1997), its consistency can be shown more easily (as suggested 
by one of the anonymous reviewers). Suppose that the contrast function of JADE 


Jave(W)= } [Yiyi Yey) 
ijklŻiikl 


uniformly converges to the ideal contrast function 


Jaw W)= $ |cumoiyj yey) 
ijkl Ziükl 


on the set of orthogonal matrices W such that WW! = I, where cum and cum denote the empirical 
and the expected cumulant tensor, respectively. Then, the minimum of the Jyape(W ) converges 
to that of Jj,pp(W). If W is the true demixing matrix and y;’s are extracted signals with W, the 
components Kj jx := cum(y;, Y j, Yk, Y1) of the expected cumulant are zero except for i = j =k = l or 
i= j#k=lori=l+ j= k. Thus, one needs only to show that the estimating equation is associated 
to the minimization of 2 >: [cum(y;, yj, Yi, yj) |7 under the orthogonality constraints which is satisfied 
l 
when y; equals the true i (up to a scaling and a permutation). The estimating equation is 
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which can be seen to be satisfied, when y;’s equal to the true sources. We remark that the same 
formula as (45) can be obtained after the rigorous analysis. The function which is associated with 
the asymptotic estimating function (see (43)) becomes 
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A.4 TDSEP/SOBI 


Let us define lagged covariance matrices of x(t) 


T 


R(At) = > YE fxs a0) 
t=At+ 
ay ee y E [s()s TC Ar)| ANI 
= ag È B[sios ean] pan 


When the sources s;’s are mutually independent and have temporal covariance structure, the demix- 
ing matrix PD(A*)~! can diagonalize all lagged covariance matrices R(Ar), where P is a permutation 
matrix and D is a diagonal matrix. This property has been used in blind separation methods with 
second order statistics (Tong et al., 1991; Belouchrani et al., 1997; Ziehe and Miiller, 1998). 


In the variance-dependent BSS model, for any i, j, t and At > 1, 


E[s;(t)s;(t — At)] = E[v,(t)v;(t — At)] E [z:(t)z;(t —At)] = 0. 


Therefore, R(At) = 0 for Ar > 1, that is, we cannot get any information about the mixing matrix A 
from lagged covariance matrices R(At). This is why TDSEP does not work for this model. 
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